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ABSTRACT 

To understand the molecular mechanisms that 
underlie global transcriptional regulation, it is es- 
sential to first identify all the transcriptional regula- 
tory elements in the human genome. The advent 
of next-generation sequencing has provided a 
powerful platform for genome-wide analysis of dif- 
ferent species and specific cell types; when 
combined with traditional techniques to identify 
regions of open chromatin [DNasel hypersensitivity 
(DHS)] or specific binding locations of transcription 
factors [chromatin immunoprecipitation (ChIP)], and 
expression data from microarrays, we become 
uniquely poised to uncover the mysteries of the 
genome and its regulation. To this end, we have per- 
formed global meta-analysis of the relationship 
among data from DNasel-seq, ChlP-seq and expres- 
sion arrays, and found that specific correlations 
exist among regulatory elements and gene expres- 
sion across different cell types. These correlations 
revealed four distinct modes of chromatin domain 
structure reflecting different functions: repressive, 
active, primed and bivalent. Furthermore, CCCTC 
-binding factor (CTCF) binding sites were identified 
based on these integrative data. Our findings un- 
covered a complex regulatory process involving by 
DNasel HS sites and histone modifications, and 
suggest that these dynamic elements may be re- 
sponsible for maintaining chromatin structure and 
integrity of the human genome. Our integrative 
approach provides an example by which data from 
diverse technology platforms may be integrated to 



provide more meaningful insights into global tran- 
scriptional regulation. 

INTRODUCTION 

Consistent proper function of all biologic processes relies 
on the precise spatial and temporal expression of genes 
(1-3). Development, differentiation, proliferation, apop- 
tosis, and even aging, are the culmination of cell 
type-specific and ubiquitous gene expression. Since tran- 
scription was first described researchers have sought to 
define the molecular mechanisms that regulate this phe- 
nomenon, driven by the belief that understanding the gene 
expression profiles of normal and disease states will facili- 
tate discoveries of therapeutic targets to alleviate human 
and animal suffering. These works have defined several 
types of ds-acting transcriptional regulators, including 
promoters, enhancers, insulators and locus control 
regions (LCR) (1,4,5) and the £nms-acting factors that 
bind to them. Nonetheless, the relative roles of these regu- 
latory DNA elements have yet to be fully elucidated. The 
introduction of high-throughput sequencing (6) and its 
massive amounts of data spanning entire genomes of 
species has provided a platform from which we may 
begin to examine global patterns of gene expression and 
compare these patterns among different cell types to gain 
a clearer understanding of the molecular mechanisms 
underlying the dynamic and complex processes of life. 

Next-generation sequencing (NGS) has become a 
popular approach to identifying gene regulatory elements 
and to performing accurate functional analysis (6). NGS 
of DNA-protein complexes isolated by chromatin 
immunoprecipitation, a procedure known as (ChlP-seq), 
has allowed for global localization of regulatory elements 
associated with a specific protein of interest (7-11). 
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Unfortunately, this combined technique is only applicable 
to known (previously characterized) trans-&Q\mg factors 
and is limited by its requirement for a high quality 
ChlP-grade antibody to isolate the transcription factor 
(TF) to be analyzed (9). By coupling the NGS method 
with DNasel hypersensitive (DNasel HS) site mapping 
(long considered the gold-standard for comprehensively 
identifying the location of various classes of transcrip- 
tional regulatory elements), a particularly powerful high- 
resolution procedure, DNase-seq, was developed (12-18). 
Like the ChlP-seq procedure, though, DNase-seq suffers 
from some inherent limitations. DNase-seq provides only 
location data and is unable to directly characterize 
function or identify the particular TF(s) associated with 
the region. 

The data obtained from each of these combined-NGS 
procedures may be analyzed in parallel (along with data 
obtained from gene expression arrays) to facilitate the 
identification of bona fide transcriptional regulatory 
elements. First, though, we must obtain a thorough under- 
standing of the different types of c/s-regulatory sequence 
elements and epigenetic modulatory mechanisms in order 
to accurately investigate their contributions to spatial and 
temporal gene expression. 

The first genome-wide maps of histone methylation (7) 
and acetylation marks (19) were generated from human 
resting CD4 + T cells. Histone modifications associated 
with gene transcription were designated as active, while 
those associated with repressed transcription were desig- 
nated as repressive. Intriguingly, some of the 'active' were 
identified in transcriptionally silent genes (7,20-23), sug- 
gesting that these modifications may act more as markers 
of genes primed for transcriptional activity. Not surpris- 
ingly, then, histone modification is not the sole mediator 
of expression level (24). By performing DNase-seq and 
DNase fragment, hybridization to microarray chip 
(DNase-chip), Boyle et al. (25) created a comprehensive 
genome-wide map of the open chromatin regions in CD4 + 
T cells. Their analysis of the resultant data sets did not 
identify a clear correlation between DNasel HS and levels 
of gene expression. Shortly thereafter, Xi et al. (26) used 
DNase-chip to comparatively analyze six human cell types 
in order to identify functional cell type specific and ubi- 
quitous DNasel HS sites (DHSs). Their examination of 
1 % of the human genome revealed that cell type-specific 
DNasel HS sites co-localized with cell type-specific gene 
expression. Recently, Stitzel et al. (27) conducted genome- 
wide analysis of DNasel hypersensitive sites in human 
islets. Ling et al. (28) produced a set of detailed, high- 
quality, genome-wide DNasel hypersensitivity maps in 
the mouse liver in vivo. These studies highlight the utility 
of DNase-seq for systematically uncovering ds-regulatory 
elements on a genome- wide scale. 

In the study presented herein, we performed a genome- 
wide meta-analysis of DNasel HS sites identified in 29 
different cell types. We sought to determine the relation- 
ship between DNasel HS, histone modifications and gene 
expression. We found that specific correlations exist 
between DNasel HS, gene expression and the amounts 
of active and repressive histone modifications across dif- 
ferent cell types. These correlations displayed four distinct 



modes (repressive, active, bivalent and primed), reflecting 
different functions of the chromatin domains. 
Furthermore, CCCTC binding factor (CTCF) binding 
sites were newly identified based on these integrative 
data. Our findings revealed a situation of complex regula- 
tion of gene expression mediated by DNasel hypersensi- 
tive chromatin regions and their histone modifications. 

MATERIALS AND METHODS 

DNase-seq, ChlP-seq, gene expression and Gencode data 

All data used in this study were freely available for down- 
load from the University of California, Santa Cruz 
(UCSC) NCBI36/hgl8 Genome Browser (http: //genome 
.ucsc.edu/encode/) (29,30). The Open Chromatin (25) 
track was used to obtain DNase-seq data from 29 cell 
lines and ChlP-seq data of CTCF, c-Myc and RNA Pol 
II. The Broad Histone (22) track provided the ChlP-seq 
data of histone modifications from nine cell lines. Regions 
of enriched signal in either DNasel HS or ChIP experi- 
ments were identified using F-Seq (31). The gene annota- 
tions presented herein were taken from the Gencode data 
(32) in the Gencode Genes track (Version 3c, October 
2009). The detailed information of these data was pre- 
sented in Supplementary Table SI. Finally, the 17 751 
UCSC known human genes with expression data were 
obtained from the Gene Expression Omnibus database 
(series GSE15805 and GSE17793; http://www.ncbi.nlm. 
nih.gov/projects/geo/). All chromosome Y data were 
omitted from this study. 

Classification of DNasel HS sites relative to cell types 

Considering the lineage specificity observed with DNasel 
HS sites (Supplementary Figure SI) (26), we classified 
DNasel HS sites according to their occurrence rates in 
29 cell lines. A given DNasel HS site is classified as cell 
type specific, if it does not overlap (Here, overlap between 
two binding sites represents that these two regions have at 
least one common base pair.) with any DNasel HS site 
within other 28 cell lines. A given DNasel HS site is clas- 
sified ubiquitous, if it overlaps with any DNasel HS site 
within all 28 cell lines. The remaining DNasel HS sites are 
classified as common, which are present in two to 28 cell 
lines. The results showed that about one-half of DNasel 
HS sites were found in two cell lines (Supplementary 
Table S2). The cell lines that had the highest DNasel 
HS overlap (74.96%) were the two lymphocyte lines, 
GM12891 and GM12892; the lowest overlap (25.99%) 
was found between AoSMC (aortic smooth muscle) and 
Medullo (Medulloblastoma). This agreed with the cluster- 
ing analysis that we performed on DNasel HS sites 
(Supplementary Figure SI). 

Tag density profiles 

The profiles of tag density at DNasel HS sites or at tran- 
scription start sites (TSSs) were generated as described by 
Wang et al. (19). Briefly, the DNasel HS region was 
examined in 200 bp windows that spanned the 5 kb imme- 
diately upstream of the DNasel HS start site (dxStart) and 



7430 Nucleic Acids Research, 2011, Vol. 39, No. 17 



5kb downstream from the end of the DNasel HS site 
(dxEnd); each window was evaluated for content of 
uniquely mapped ChlP-seq data of histone modifications 
and TFs. The DNasel HS site itself was divided equally 
among 10 windows for detailed analysis. All window tag 
counts were normalized to the total number of bases 
present in the window and to the total read number of 
the given library. 

To plot the profiles of those DNasel HSs associated 
with TSSs, the 17 751 UCSC known genes with expression 
information were categorized among broad groups ac- 
cording to their reported expression levels: high, median 
or mainly silent. One thousand genes were selected per 
group and corresponding DNase-seq data was analyzed 
after each was aligned by their TSS. 

Gene density and TFBSs associated with DNasel HS sites 

The entire genome was scanned by 2 Mb windows. Within 
each window, the number of genes, DNasel HS sites, 
CTCF binding sites, Pol2 and c-Myc binding sites were 
quantitated. Linear regression was used to determine the 
correlation between gene density and binding site density. 

Enrichment/depletion of histone modifications at DNasel 
HS sites 

To quantitatively measure histone modification enrich- 
ment or depletion at DNasel HS sites, we evaluated the 
histone signal based on the profiles of histone modification 
tag density by using the formula: Pq HSs , / = 1, • • • ,10, 
where ^d HSs and ^d HSs wee tag density at dxStart and 
dxEnd, respectively. Enrichment or depletion of histone 
modifications at DNasel HS sites were defined as 

C = (^hss+ J Pdhss)/( J Pdhss+^dhss). where ?>1.0 indi- 
cated enrichment and 0 < f < 1.0 indicated depletion. 

DNasel HS sites associated with histone modifications and 
expression levels 

Cell type specific, common and ubiquitous DNasel 
HS sites were divided among 100 groups according to 
DNasel hypersensitivity. The average tag density of 
DNasel HS and of histone modifications was calculated 
for each group. 

The 17 751 UCSC known genes with expression infor- 
mation were divided into 100 groups, based on expression. 
The average tag densities of each modification and the 
DNasel HS within the gene body were calculated for 
each group, respectively. 

Motif identification 

To carry out motif identification, we examined data of 
DNasel HS sites that encompassed defined ChlP- 
enriched regions in each cell line. CUDA-MEME 
(Version 3.0) (33), which was programmed using hybrid 
CUDA (Compute Unified Device Architecture) based on 
MEME (Version 4.4.0) (34) was used to discover consen- 
sus motifs with default parameters. 

To determine the number of peaks that could be ex- 
plained by statistically significant motifs, the MEME 
tool MAST (35) was used to estimate the maximal 



difference between the total number of peaks containing 
a motif and the number that could be explained by chance 
within a range of stringencies (£- values). See Methods in 
Supplementary Data for generation of MAST curves. 



RESULTS 

Global properties of DNasel HS sites 

Classification of DNasel HS sites. We classified DNasel 
HS sites according to their occurrence rates: cell type 
specific, found in one out of 29 cell lines; common, 
found in 2 to 28 cell lines or ubiquitous, found in all 29 
cell lines (Supplementary Table S3). 

In the erythroleukemia cell type K562, 9% of the 
DNasel HS sites were found to be cell type specific, 
while the remaining 77% were common and 14% were 
ubiquitous (Figure 1A). In the strongest DNasel HS 
sites (top 20%), 58% were found to be ubiquitous and 
only 1% of them to be cell type-specific (Figure 1A). In 
contrast, in the weakest DNasel HS sites (bottom 20%), 
we found that 19% were cell type-specific and almost none 
were ubiquitous (Figure 1A). Similar observations were 
obtained from other cell lines (Supplementary Figure 
S2). Taken together, these findings indicated that ubiqui- 
tous DNasel HS sites are extremely hypersensitive to 
DNasel digestion, while cell type-specific DNasel HS 
sites are less susceptible to digestion, although still signifi- 
cantly more susceptible than the genome average. This is 
expected since ubiquitous DNasel HS sites are accessible 
in all 29 cell types analyzed, and cell type-specific DNasel 
HS sites are only accessible in one cell type. 

Genome-wide coverage of DNasel HS sites represented. To 
determine whether the majority of DNasel HS sites that 
exist in the human genome were represented in the data 
sets under examination, we computed the cumulative per- 
centage of the genome covered by DNasel HS sites and 
the cumulative numbers of DNasel HS sites with respect 
to the number of cell lines tested [Methods in 
Supplementary Data; described in (26)]. Therefore, as add- 
itional cell lines were included in the analysis, the total 
number of DNasel HS sites being investigated increased; 
ultimately, we examined approximately 900 000 DNasel 
HS sites from 29 cell lines (Figure IB). Accordingly, the 
total percentage of base pairs (bps) in the human genome 
covered by DNasel HS sites was calculated at 8.28%. A 
total of 18 943 ubiquitous and 10100 cell type-specific 
DNasel HS sites were identified from the 29 cell types, 
which covered 0.67 and 0.07% of bp in the human genome, 
respectively. However, even after addition of the 29th cell 
line we were unable to reach a significant saturation level 
of total DNasel HS sites, which would have been repre- 
sented by equal levels of cell type specific, common and 
ubiquitous DNasel HS sites. This finding was consistent 
with previously published results using only six cell types 
and examining 1% of the human genome (26), and sup- 
ported the suggestion by those authors that substantially 
more cell lines/ types must be included in future analysis to 
identify the majority of the DNasel HS sites that exist in 
the entire human genome. 
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Figure 1. Identification and characterization of DNasel HS sites. (A) Genome-wide distribution of DHSs relative to cell type. Total numbers of cell 
type specific, common and ubiquitous DHSs are indicated. The proportions of DHSs in the K562 cell line among the strongest scoring (top 20%) 
and weakest scoring DHSs (bottom 20%) are shown. (B) Cumulative percentage of the genome and cumulative numbers covered by DNasel HS sites 
from increasing numbers of cell lines (x-axis). Cumulative percentage of the genome (solid lines) and numbers (dashed lines) covered by all (red), cell 
type specific (green) and ubiquitous (blue) DNasel HS sites from any cell line. Each point represents an averaged value of all possible cell line 
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Genome-wide localization of DNasel HS sites. To obtain 
an overall picture of the DNasel HS site distribution 
pattern relative to genes, we divided the human genome 
into four regions: proximal promoter, 1 kb upstream and 
downstream of the TSS; exon; intron and intergenic re- 
gions. Groups were assigned based on the 'GENCODE' 
annotation published in the UCSC Genome Browser. 

In the K562 cell line, only 14 and 9% of all DNasel HS 
sites mapped to proximal promoters and exons, respect- 
ively, while 24% mapped to introns (Figure 1C). The 
majority (53%) mapped within intergenic regions. 
However, we discovered that over 20% of DNasel HS 
sites located within intergenic region, on average, are 
associated with strong Pol II signals; this correlation 
indicated these DNasel HS sites may in fact represent a 
promoter, exon or intron of an unannotated gene 
(Supplementary Data and Supplementary Figure S3). 
The genome-wide location results were well agreed with 
previously published results in human islets (27) and 
CD4 + T cells (25). Of note, the DNasel HS sites found 
to be located within promoters and exons were over 2-fold 
larger in size and twice as hypersensitive than those located 
in the introns or intragenic regions (Supplementary 
Table S4). When we considered only the cell type-specific 
DNasel HS sites, we found that ~9% were located in 
proximal promoters or exons. In stark contrast, nearly 
one-half of the ubiquitous DNasel HS sites were located 
in proximal promoters or exons (Supplementary Table 
S5). Similar distribution patterns were observed in other 
cell lines (Supplementary Tables S4 and S5). These 
findings, when considered along with the results from 
our analysis of overlapping CpG islands and sequence 
conservation within those sites (Supplementary Figures 
S4 and S5; Methods in Supplementary Data), indicate 
that ubiquitous DNasel HS sites are generally associated 
with promoters of so-called 'housekeeping' genes. 

Gene density and binding sites associated with DNasel HS 
sites. To examine the correlation of DNasel HS sites with 
gene density in the surrounding regions and presence of 
TFBSs, we segmented each chromosome into 2Mb 
windows and counted the numbers of DNasel HS sites 
and genes and binding sites within each. In general, the 
DNasel HS sites were found to strongly correlate with 
genes in all of the cell lines examined (R 2 = 0.9122 in 
K562; Figure ID). Additionally, the DNasel HS sites 
were highly correlated with Pol II in each cell line 
(R 2 = 0.9553 in K562; Figure ID). These results were con- 
sistent with observations by others that the open chroma- 
tin state is affiliated with gene dense regions in the genome 
(36). Interestingly, we found that DNasel HS sites cor- 
related with TFBSs, as evidenced by analysis of the 
CTCF and c-Myc binding sites (R 2 = 0.9052 and 
R 2 = 0.9383, respectively; Figure ID). This finding 



confirmed the Ling et al. (28) study for a larger set of 
TFs that binding sites show up to 90% overlap with 
DNasel HS sites. The property of DNasel HS site distri- 
bution is consistent with its role at identification of TFs 
and suggests a widespread function of DNasel HS sites in 
the genome. 

Genome-wide correlation of DNasel HS sites and histone 
modifications 

Histone modifications near and in DNasel HS sites. To 
characterize the histone modification patterns at DNasel 
HS sites, we aligned the DNasel HS sites of each group 
(cell type specific, common, ubiquitous) and com- 
pared each with the histone modifications of that region. 
Methylation and acetylation were examined, as distinct 
forms of each have been associated with activation, re- 
pression or both according to context. For example, 
the H3K27me3 methylation modification has been 
reported as being present at sites of repressed a gene 
expression (19). 

As shown in Figure 2, all three states of H3K4 methy- 
lation and acetylation of both H3K9 and H3K27 were 
sharply elevated in different types of DNasel HS sites. 
H3K9mel and H4K20mel were modestly elevated in cell 
type specific and common DNasel HS sites, and the ele- 
vated levels of these two marks were diminished in ubi- 
quitous DNasel HS sites (Figure 2). We did not observe 
elevated levels of trimethylation of either H3K27 or 
H3K36 in areas surrounding the DNasel HS sites 
(Figure 2). The tag densities of these histone modifications 
in ubiquitous DNasel HS sites were much higher than 
those in cell type-specific DNasel HS sites, except for 
H3K4mel and H3K27me3. The percentages of various 
histone modifications overlapping with DNasel HS sites 
from each of the different types are shown in the inset of 
Figure 2. Only a small fraction of the cell type-specific 
DNasel HS sites overlapped with H3K4me2, H3K4me3, 
H3K9ac and H3K27ac; in contrast, about one-half of 
cell type-specific DNasel HS sites overlapped with 
H3K4mel, H3K9mel, H3K27me3 and H4K20mel 
(Figure 2A). Nearly 70% of the ubiquitous DNasel HS 
sites overlapped with all of the histone modifications 
(Figure 2C), and of these, about one-fourth were located 
in proximal promoters. 

We next examined the enrichment/depletion of histone 
modification profiles at each type of the DNasel HS sites. 
We were intrigued to discover that both active and re- 
pressive histone modifications corresponded to the deple- 
tion at the peak signal of ubiquitous DNasel HS sites, 
albeit to differing extents (Figure 2 and Supplementary 
Table S6). However, we did not observe similar modified 
histone troughs for the cell type-specific DNasel HS sites, 
except in the cases of H3K9 and H3K27 acetylation 
(Figure 2; Supplementary Table S6). This analysis 



Figure 1. Continued 

combinations. (C) The genome-wide distribution of DNasel HS sites in proximal promoters (defined as 1 kb upstream and downstream of TSS), 
exons, introns and intergenic regions. Total numbers of DHSs relative to gene annotations are shown for the representative K562 cell line. (D) The 
densities of DNasel HS sites and transcription factor binding sites on each chromosome. The points plotted on the x- and j-axes represent the 
number of binding sites/2 Mbp and the number of DNasel HS sites/2 Mb, respectively. 
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Figure 2. Histone modifications proximal to the DNasel HS sites in K562 cell type. Histone modification profiles of cell type specific (A), common 
(B) and ubiquitous (C) DNasel HS sites. The tag density for modifications is shown across the DNasel HS sites and extending 5kb upstream of the 
dxStart and downstream of the dxEnd. Inset shows the percentage of DNasel HS sites of each different type overlapping with histone modifications. 
The numbers on the x-axis correspond to: 1, H3K4mel; 2, H3K4me2; 3, H3K4me3; 4, H3K9ac; 5, H3K9mel; 6, H3K27ac; 7, H3K27me3; 8, 
H3K36me3; 9, H4K20mel. The vertical dashed line indicates the peak of DNasel HS signal within each DNasel HS site. 

(continued) 
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Figure 2. Continued 



provided supportive evidence that the DNasel HS peaks 
within ubiquitous DNasel HS sites are nucleosome 
depleted. Noticeably, we found that the enrichment /deple- 
tion at DNasel HS sites was strongly correlated with 
the level of DNasel HS (Supplementary Figure S6 and 
Supplementary Table S7). Moreover, the correlation of 
the enrichment/depletion at the ubiquitous DNasel HS 
sites was much stronger than that found at the cell 
type-specific DNasel HS sites. Trimethylation of H3K27 
was the only outlier to this trend, suggesting that the 
degree of nucleosome depletion is related to the level of 
DNasel HS. 

Correlation between DNasel HS and histone 
modifications. To examine the correlation between 
DNasel HS sites and the different histone modifications 
across the entire human genome, we separated cell type 
specific, common and ubiquitous DNasel HS sites into 
100 groups for each, based on level of DNasel hyper- 
sensitivity. These groups were then plotted against 
their average modification levels in DNasel HS sites 
(Figure 3; see 'Materials and Methods' section). 

For the cell type-specific DNasel HS sites, we observed 
a strongly negative correlation between H3K27me3 
and DNasel hypersensitivity, but a strongly positive cor- 
relation between DNasel hypersensitivity and all other 
histone modifications (Figure 3A and Supplementary 
Table S8). Similar correlations between histone modifi- 
cations and DNasel hypersensitivity were found for 



the common DNasel HS sites, with the exception of a 
weak correlation with H3K36me3 (Figure 3B and 
Supplementary Table S8). In ubiquitous DNasel HS 
sites, however, modest positive correlations between 
DNasel hypersensitivity and H3K4me2, H3K4me3 and 
H3K9ac were detected and a weak correlation was 
found with H3K27ac (Figure 3C and Supplementary 
Table S8). In this type of DNasel HS sites, DNasel hyper- 
sensitivity was negatively correlated with H3K4mel, 
H3K9mel, H3K27me3, H3K36me3 and H4K20mel; the 
H3K36me3 was most negatively correlated, followed in 
descending order by H4K20mel, H3K9mel, H3K27me3 
and H3K4mel (Figure 3C and Supplementary Table S8). 

Genome-wide correlation of DNasel HS sites and gene 
expression 

DNasel HS proximal to TSSs. To identify the general dis- 
tribution pattern of DNasel HS sites near and around 
TSSs, we generated composite profiles (n = 1000 each) 
of the 1000 most active, 1000 median and 1000 least 
active genes. The genomic region that was analyzed en- 
compassed the entire defined gene body (exons and 
introns) and extended 5 kb upstream and 5 kb downstream 
of the 5'- and S'-boundaries (Figure 4A). Numbers of tags 
in the gene body were quantitated in windows representing 
10 equal parts, and in the 5'- and 3 / -proximal regions in 
0.2 kb windows; total numbers for each window were then 
summed to obtain to the overall methylation level for each 
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Figure 3. Correlation between DNasel HS and histone modifications in K562 cell type. Cell type specific (A), common (B) and ubiquitous (C) 
DNasel HS sites were grouped among 100 sets (dot) based on their hypersensitive levels (from high to low, left to right on the x-axis). The average 
tag density of DNasel hypersensitive and of histone modifications was calculated for each group and plotted according to the average tag density of 
DNasel hypersensitive (right y-axis) and the histone methylation (left y-axis). 
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gene. These numbers were then normalized by the total 
number of base pairs in each region. It was notable that 
the DNasel hypersensitivity signal peaked near both the 
5'- and Spends. As such, this signature may represent a 
useful method by which to confirm annotated TSSs, to 
identify novel TSSs or to determine alternative TSSs func- 
tioning in particular cell types (37). In addition, the 
DNasel hypersensitivity level of active genes was consist- 
ently much higher than that of silent genes, suggesting that 
DNasel hypersensitivity is associated with transcriptional 
activation. 

We then examined the distribution of DNasel HS sites 
that overlapped these genes. As shown in the inset of 
Figure 4A, 849 of the most active genes were found to 
be associated with DNasel HS sites, whereas only 421 of 
the least active genes were similarly associated. Of those 
most active genes, 258 (>30%) associated with ubiquitous 
DNasel HS sites, while only 22 (~5%) of the most silent 
genes associated with ubiquitous DNasel HS sites. 
Association with cell type-specific DNasel HS sites was 
found with 22% (189) of the most active genes and 
>30% (148) of the most silent genes, respectively. These 
results suggested that ubiquitous DNasel HS sites are 
associated with gene activation, whereas cell type-specific 
DNasel HS sites are associated with gene repression. 

DNasel HS and gene expression. To investigate whether 
the DNasel hypersensitivity level was correlated with gene 
expression level, we grouped the genes into 100 gene sets 
according to expression level. The sets were then plotted 
against the DNasel hypersensitivity levels in the 
transcribed regions (Figure 4B). This analysis indicated 
that a strong positive correlation exists between the level 
of DNasel hypersensitivity and gene expression 
(7? = 0.8713, P = 4.73E-32). We then used the same 
method to assess the correlation of gene expression with 
the level of DNasel hypersensitivity within the different 
types of DNasel HS sites (Supplementary Figure S7), and 
found that the positive correlation remained for each. Cell 
type-specific DNasel HS sites were more robustly 
correlated with gene expression than were the ubiquitous 
DNasel HS sites, suggesting that cell type-specific DNasel 
HS sites are involved in cell type-specific gene regulation 
(Supplementary Figure S7). 

Four distinct modes of chromatin domains 

DNasel HS related to both histone modifications and gene 
expression. To further clarify whether the relationship 
between the DNasel HS sites and histone modifications 
correlates with gene expression, we compared the DNasel 
HS sites and histone modifications with gene expression 
levels of 17 751 genes. We generated an image plot to 
determine the average signals of active and repressive 
histone modifications relative to the tag density of 
DNasel hypersensitivity and gene expression levels 
(Materials in Supplementary Data). Trimethylation of 
H3K4 and H3K27 were examined as representatives of 
active and repressive histone modifications, respectively. 
The results further confirmed previous observations that 
DNasel hypersensitive signal is strongly positively 



correlated with active histone modifications and inversely 
correlated with repressive histone marks (Supplementary 
Figure S8). The genes with higher levels of the H3K4me3 
signal tended to be expressed at higher levels, while the 
presence of H3K27me3 signals tended to correlate with 
decreased levels of expression. This was also consistent 
with previously published findings by others in which 
gene expression was significantly associated with 
presence of histone modifications (7,19). The interrelated- 
ness of DNasel hypersensitivity, gene expression and 
histone modifications indicated that active histone modi- 
fications generally correlate with the open chromatin state 
and active gene expression, whereas repressive histone 
marks indicate closed chromatin and gene silencing 
(Supplementary Figure S8). 

H3K4me3 and H3K27me3 are referred to as a pair of 
'active-repressive' modifications with regard to their 
effects on gene activity (7). The overlapping islands of 
H3K4me3 ('active') and H3K27me3 ('repressive') histone 
modifications are defined as 'bivalent domains', which 
have been implicated in the development and differenti- 
ation of mammalian embryonic stem cells and 
differentiated cells (7,21,22,38-41). By counting the total 
numbers of H3K4me3 and H3K27me3 modifications 
within each of the different types of DNasel HS sites 
from the K562 cell line, we determined that, 15 and 
45% of all DNasel HS sites were associated with 
H3K4me3 and H3K27me3, respectively (Figure 5A). 
These numbers were consistent with other cell lines 
examined (Supplementary Figure S9). Intriguingly, 
>10% of all DNasel HS sites were associated with both 
H3K4me3 and H3K27me3 modifications in multiple cell 
lines of different origins (Figure 5A and Supplementary 
Figure S9), indicating that bivalent domains are a wide- 
spread phenomenon in mammalian cells and suggesting 
that the 'active-repressive' switch is functionally relevant. 
However, recent studies suggested that bivalent marks, as 
described for mammalian embryonic stem cells, do not 
exist in Xenopus embryos (42,43). Future studies in differ- 
ent model organisms at various developmental stages are 
essential to elucidate the curious case of the occurrence or 
absence of bivalent marks (43). 

Examining the composition of these DNasel HS sites 
showed that over one-third of those with H3K4me3 alone 
or with both 3K4me3 and H3K27me3 were of the ubiqui- 
tous type (Figure 5 A and Supplementary Figure S10). 
However, only 7% of the DNasel HS sites overlapping 
with either H3K4me3 alone, both 3K4me3 and 
H3K27me3 or H3K27me3 alone, were of the cell type- 
specific type (Figure 5 A and Supplementary Figure S10). 
We then sought to determine whether the signals of 
DNasel hypersensitivity, gene expression and H3K4me3 
and H3K27me3 significantly differed among the DNasel 
HS sites that were associated with H3K4me3 alone, both 
H3K4me3 and H3K27me3 or H3K27me3 alone, or not 
associated with H3K4me3 or H3K27me3 at all. As 
shown in Figure 5B, the highest levels of gene expression, 
DNasel hypersensitivity and H3K4me3, and the lowest 
levels of H3K27me3 were associated with DNasel HS 
sites that overlapped with H3K4me3 alone. The lowest 
levels of gene expression, DNasel hypersensitivity and 



Nucleic Acids Research, 2011, Vol. 39, No. 17 7437 



A 25 




Distance from gene 



B 0-12 



0.1 



0.08 



■ 0.06 



0.04 



0.02 




-DNasel HS 
■Gene expression 



14 



12 



10 



Q 

M— 

o 



20 



40 



60 



80 



100 



Figure 4. Correlation between DNasel HS and gene expression in K562 cell type. (A) Profiles of DNasel HS sites across the gene bodies of the most 
active, median or most silent genes in = 1000 each group). Inset shows the percentage of DNasel HS sites of each different type that composed the 
three gene groups. The numbers on the x-axis correspond to: 1, cell type specific; 2, common; 3, ubiquitous. (B) DNasel HS sites were grouped into 
100-gene sets (dot) based on their expression levels (high to low, left to right on the x-axis). The average tag densities of the DNasel HS within the 
gene body and the average gene expression level were calculated for each group and plotted according to the average gene expression level (left 
y-axis) and the average tag density of DNasel hypersensitive (right j-axis). 
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H3K4me3, and the highest levels of H3K27me3 were 
associated with DNasel HS sites that overlapped with 
H3K27me3 alone. The DNasel HS sites that overlapped 
with both H3K4me3 and H3K27me3 were characterized 
by high levels of gene expression, DNasel hypersensitivity, 
and H3K4me3 and H3K27me3. 

Four distinct functions of chromatin structure. Based on 
the relationship of DNasel hypersensitivity with active 
and repressive histone modifications and gene expression, 
we theorized that at least four major functional modes 
existed for the different chromatin domain structures 
observed in the human genome across different cell 
types (Figure 5C). (i) Active chromatin domains were 
characterized by relatively higher levels of active histone 
modifications, DNasel hypersensitivity and gene expres- 
sion, and lower levels of repressive histone modifications 
(Figure 5B; Region I in Figure 5C), (ii) Silent chromatin 
domains were characterized by lower levels of active 
histone modifications, DNasel hypersensitivity and gene 
expression and higher levels of repressive histone modifi- 
cations (Figure 5B; Region II in Figure 5C), (iii) The 
bivalent chromatin domains were characterized by high 
levels of both active and repressive histone modifica- 
tions, and high levels of DNasel HS and gene expression 
(Figure 5B; Region III in Figure 5C) and (iv) The primed 
chromatin domains are characterized by patterns of active 
histone modifications and gene expression similar to the 
active chromatin domains, but also have similar patterns 
to the repressive chromatin domains of repressive histone 
modifications and DNasel hypersensitivity (Figure 5B; 
Region IV in Figure 5C). 

Identification of CTCF binding sites 

Next, we investigated whether the specific correlation 
pattern between the data from DNase-seq and ChlP-seq, 
in combination with gene expression from microarray 
analysis, could be used to predict transcription factor 
binding sites (TFBSs) in the human genome. 

In vertebrates, the CTCF, a ubiquitously expressed 1 1 
zinc finger (ZF) protein (44,45), is necessary for insulator 
element function (46-48). To characterize how the CTCF 
binding sites are distributed along the human genome, we 
performed computational meta-analysis of the DNase-seq 
and ChlP-seq data to identify potential CTCF binding 
sites. We examined the DNasel HS sites associated 
with CTCF ChlP-enriched regions across cell lines 
(Supplementary Table S9). Of the 35 307 DNasel HS 
sites that overlapped with CTCF enriched regions in cell 
type K562, 13 007 were distal (>1 kb) to an annotated TSS 
or RNA Pol II signal. Then, we applied the de novo motif 
finder MEME to identify CTCF binding sites within these 
13 007 distal DNasel HS sites. 

Our MEME analysis revealed that the CTCF consensus 
DNA binding motif was enriched in these DNasel HS 
sites (Figure 6A). Furthermore, the consensus motif was 
identical in each of the cell types studied (Supplementary 
Figure SI 1) and was consistent with previous findings 
from other cell lines (human islets, CD4 + T cells, HeLa 
cells and Jurkat cells) (27,49). Most of the distal DNasel 



HS sites (10 397 out of 13 007, 79.9%) contained at least 
one consensus motif (Figure 6B). Of those, 149 (1.4%) 
were cell type specific, whereas 7969 (76.7%) and 2279 
(21.9%) were common and ubiquitous, respectively. This 
agreed well with previous observations that CTCF 
binding in insulator regions is similar across diverse cell 
types (50). In a statistical context, the consensus motif 
explains 65% (8456 out of 13 007) of the distal DNasel 
HS sites after accounting for motifs that are expected 
to occur by chance. Compared to using DNasel-Seq 
data (29%, 39 951 out of 139121; Supplementary 
Figure S12A) or ChlP-seq data (57%, 46 328 out of 
81 688; Supplementary Figure S12B) alone, our result 
illustrated the high accuracy of CTCF binding sites dis- 
covery based on the integrative data (Figure 6B). Most 
often, these distal DNasel HS sites contained only a 
single motif (Figure 6C). 

The canonical motif was highly located on the peak of 
the DNasel HS signal within each DNasel HS site 
(Supplementary Figure S13A), which was consist with 
previous finding (27). This indicated that these peaks 
serve as the point of contact by the protein in vivo. Not 
surprisingly, we also found that the identified motif is 
located far distally from the nearest upstream or down- 
stream associated genes (Supplementary Figure S13B), a 
finding that would explain the widespread and fundamen- 
tal role of CTCF. Although the identified motif appeared 
to represent the major CTCF binding sequence, a signifi- 
cant number of the sites lacked the consensus sequence. 
Another study recently found that CTCF can bind to 
genomic regions that apparently lack the defined motif 
(51); this may be a result of DNA-CTCF interactions 
mediated by contacts with distinct arrangements of 
CTCF's 11-ZFs (44,48,52-54). 

DISCUSSION AND CONCLUSION 

Global properties of DNasel HS sites 

The dynamic and complex regulatory elements of gene 
transcription that underlie every biological process, from 
cell type-specific functions to systemic response to the en- 
vironment, remain to be completely defined or under- 
stood. Whole-genome mapping of DNasel HS sites has 
provided crucial clues to regions of transcriptional regu- 
lation. By combining these data with data from ChlP-seq 
and gene expression microarray experiments, we may gain 
a better understanding of this process. To this end, we 
performed a meta-analysis using each of these datasets 
that are publically available. By classifying the DNasel 
HS sites according to their characteristic of cell type 
specific, common or ubiquitous, we were able to identify 
approximately 900 000 DNasel HS sites from 29 diverse 
cell lines. These sites of presumed transcriptional regula- 
tory function encompassed 8.28% of the human genome, 
which was in agreement with the proportion reported in a 
related study (26). Detailed examination of all the DNasel 
HS sites in each of the 29 cell lines revealed that only 
approximately 10 000 (~7%) are cell type specific and ap- 
proximately 19 000 (~14%) were ubiquitous, covering 
0.67 and 0.07% of the human genome, respectively. 
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Figure 6. Motif analysis of DNasel HS sites. (A) Significantly enriched CTCF consensus motifs are graphically depicted using Weblogos. (B) MAST 
curves. Horizontal axis represents the E-value, the number of peaks expected to contain a given motif by chance. The vertical axis represents 
the number of peaks identified. The red curve (peaks with motifs) represents the number of peaks containing a given motif at the corresponding 
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The percentage of cell type-specific DNasel HS sites was 
much smaller than that obtained previously by Xi et al. 
(26). This discrepancy may be due to the particular cell 
types examined in our study or the fact that we examined 
more different cell types in total. Nonetheless, even 
though 29 cell types were considered in our study, we 
were also unable to reach the statistical saturation point 
for the DNasel HS sites that are likely to exist throughout 
the human genome. This finding indicated that functional 
TFBSs make up a more significant percentage of the 
genome. 

By investigating the ubiquitous DNasel HS sites, we 
hoped to gain insight into the function of housekeeping 
genes and overall chromatin structures. Our location 
analysis of DNasel HS sites relative to annotated genes 
indicated that nearly 50% of the ubiquitous DNasel HS 
sites in these data sets were located in proximal promoters 
or exons. This analysis, when considered in tandem with 
the overlapping with CpG islands between DNasel HS 
sites and levels of sequence conservation among different 
cell lines (Supplementary Figures S4 and S5), indicated 
that ubiquitous DNasel HS sites are generally associated 
with housekeeping promoters of genes. The strong correl- 
ation of DNasel HS sites with TFBSs revealed that 
DNasel HS sites are highly restricted to gene regions and 
their transcriptional regulatory elements. This provides 
further proof that DNasel HS sites in the genome are 
valuable markers of TF binding regions. 

Correlations among DNasel HS, histone modifications 
and gene expression 

Other groups have noticed sharp declines in the presence 
of active histone marks near TSSs, termed as nucleosome 
depleted regions (7,25,55). Interestingly, we observed that 
both active and repressive histone modifications are char- 
acterized by modified histone troughs, which are centered 
at DNasel HS peaks of ubiquitous DNasel HS sites, sug- 
gesting that these regions are nucleosome depleted. The 
degree of nucleosome depletion is undoubtedly related to 
chromatin accessibility and level of DNasel hypersensitiv- 
ity (Supplementary Figure S6; Supplementary Table S7). 
Our finding extends previous observations made in yeast, 
flies and mammalian systems, particularly the human, that 
nucleosome depletion is a general characteristic of active 
promoters. Furthermore, our finding provides supportive 
evidence that nucleosome depletion is associated with ubi- 
quitous DNasel HS sites and the ubiquitous DNasel HS 
sites are nucleosome depleted. 

Both cell type specific and ubiquitous DNasel HS sites 
are generally positively correlated with active histone 
modifications H3K4me2/3, H3K9ac, H3K27ac and nega- 
tively correlated with the repressive modification 
H3K27me3. Interestingly, monomethylations of H3K4, 
H3K9, H3K36 and H4K20 display a more complex func- 
tional relationship with chromatin, as they are positively 
correlated with cell type-specific DNasel HS sites and 
negatively correlated with ubiquitous sites. Correlation 
of DNasel hypersensitivity with gene expression suggests 
that DNasel hypersensitivity is highly correlated with 
transcriptional activation. Ubiquitous DNasel HS sites 



were associated with active genes, while cell type-specific 
DNasel HS sites were correlated with silent genes. 
Correlations of DNasel hypersensitivity, histone modifi- 
cations and gene expression indicated that active histone 
modifications generally correlated with active gene expres- 
sion and the open (accessible) chromatin state, whereas 
repressive histone marks correlated with gene silencing and 
tightly packed (inaccessible) chromatin state (Figure 5C). 
These specific correlations were summarized to reveal four 
distinct modes of chromatin domain function: repressive, 
active, primed and bivalent. These modes of association, 
which are much more finely described than the traditional 
classification of heterochromatin (open) or euchromatin 
(closed), provide insights into the complex structure and 
function of chromatin. 

Identification of regulatory elements via an integrative 
approach 

After systematic exploration of the relationships between 
data emanating from DNase-seq, ChlP-seq and gene ex- 
pression microarrays, we were able to identify transcrip- 
tional regulatory elements by using the de novo motif 
finder MEME. Our limited MEME analysis of the CTCF 
binding sites served to illustrate the high specificity and 
accuracy in identification of transcriptional regula- 
tory elements using the integrated data sets (Figure 6B 
and Supplementary Figure SI 2). The advent of high- 
throughput methods, including DNase-chip/DNase-seq, 
ChlP-chip/ChlP-seq and gene expression arrays, has en- 
couraged large storage databases of related data and pub- 
lic availability for more significant analysis. Combining 
distinct genome scale data in meta-analysis approaches 
represents the next era of genomic research. 

Our integrative analysis presented herein can be con- 
sidered as first-order data integration. The simplicity of the 
overlapping approach to determine regions and character- 
istics of enriched domains from DNase-seq and ChlP-seq 
data, along with the motif discovery and Gene Ontology 
strategies, represent an efficient means by which to 
perform integrative analysis. The next higher order, com- 
prehensive, integrative analysis should integrate diverse 
high-throughput sequence tags directly. Additionally, 
interactome data (56) can be applied to facilitate identifi- 
cation of the target genes of DNasel HS sites, since TFs 
and their three-dimensional interactions are crucial to 
gene regulation (57,58). Technologies based on chromo- 
some conformation capture (3C) (59), circularized 
chromosome confirmation capture (4C) (60), carboncopy 
chromosome confirmation capture (5C) (61) and the 
recently developed methods of Hi-C (62) and chromatin 
interaction analysis by paired-end tag sequencing 
(ChlA-PET) (63,64), can be used to provide a high- 
resolution genome-wide map of long-range genomic inter- 
actions. Integrating a wide variety of genomic data sets, 
including genomes, epigenomes (55), transcriptomes (65) 
and interactomes (56) will significantly enhance our under- 
standing of the complex genomic systems of all life forms 
and enhance our efforts to understand and modulate 
health and optimize well-being (66). 
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